[MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)#37190
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a dynamic LRU cache for MoE expert weights, a valuable feature for reducing GPU memory consumption. The implementation is well-structured, adding new configurations, a dedicated LRU cache class, and integrating it into the MoE layer. The new tests for correctness are also a great addition. My main feedback focuses on a performance issue within the LRU cache implementation itself, which could be optimized for better efficiency, especially with larger cache sizes.
| for expert_id in unique_ids: | ||
| if expert_id in self._expert_to_slot: | ||
| self._lru_order.remove(expert_id) | ||
| self._lru_order.append(expert_id) | ||
| self.hits += 1 | ||
| else: | ||
| if self._free_slots: | ||
| slot = self._free_slots.pop() | ||
| else: | ||
| evicted = self._lru_order.pop(0) | ||
| slot = self._expert_to_slot.pop(evicted) | ||
|
|
||
| self._buf_w13[slot].copy_(self._cpu_w13[expert_id]) | ||
| self._buf_w2[slot].copy_(self._cpu_w2[expert_id]) | ||
|
|
||
| self._expert_to_slot[expert_id] = slot | ||
| self._lru_order.append(expert_id) | ||
| self.misses += 1 |
There was a problem hiding this comment.
The current LRU cache implementation uses a list for _lru_order, which results in O(N) complexity for remove() and pop(0) operations, where N is the cache capacity. This can become a performance bottleneck for larger cache sizes.
To improve performance to O(1) for these operations, I recommend refactoring the LRU logic to use collections.OrderedDict.
This would involve the following changes:
-
In
__init__, change_lru_orderto anOrderedDict:from collections import OrderedDict # ... # LRU state (Python-only; must stay outside torch.compile). self._expert_to_slot: dict[int, int] = {} self._free_slots: list[int] = list(range(capacity)) # Front = least-recently-used expert ID. self._lru_order: OrderedDict[int, None] = OrderedDict()
-
Update the
preparemethod to useOrderedDictmethods for efficient LRU management, as shown in the suggestion below.
| for expert_id in unique_ids: | |
| if expert_id in self._expert_to_slot: | |
| self._lru_order.remove(expert_id) | |
| self._lru_order.append(expert_id) | |
| self.hits += 1 | |
| else: | |
| if self._free_slots: | |
| slot = self._free_slots.pop() | |
| else: | |
| evicted = self._lru_order.pop(0) | |
| slot = self._expert_to_slot.pop(evicted) | |
| self._buf_w13[slot].copy_(self._cpu_w13[expert_id]) | |
| self._buf_w2[slot].copy_(self._cpu_w2[expert_id]) | |
| self._expert_to_slot[expert_id] = slot | |
| self._lru_order.append(expert_id) | |
| self.misses += 1 | |
| for expert_id in unique_ids: | |
| if expert_id in self._expert_to_slot: | |
| self._lru_order.move_to_end(expert_id) | |
| self.hits += 1 | |
| else: | |
| if self._free_slots: | |
| slot = self._free_slots.pop() | |
| else: | |
| evicted, _ = self._lru_order.popitem(last=False) | |
| slot = self._expert_to_slot.pop(evicted) | |
| self._buf_w13[slot].copy_(self._cpu_w13[expert_id]) | |
| self._buf_w2[slot].copy_(self._cpu_w2[expert_id]) | |
| self._expert_to_slot[expert_id] = slot | |
| self._lru_order[expert_id] = None | |
| self.misses += 1 |
There was a problem hiding this comment.
Fixed in 8fc9268 — replaced list-based _lru_order with collections.OrderedDict. move_to_end() for hits and popitem(last=False) for eviction are both O(1).
alvinttang
left a comment
There was a problem hiding this comment.
This is a well-designed feature — the LRU expert cache is a natural approach for running MoE models that exceed GPU memory. The implementation is clean and the code is well-documented. Here's a detailed review:
1. Thread safety concern in ExpertLRUCache.prepare()
The prepare() method mutates _expert_to_slot, _free_slots, and _lru_order without synchronization. In vLLM's current architecture, the forward pass is single-threaded on the model runner, so this is fine today. But if vLLM ever moves to concurrent forward passes (e.g., disaggregated prefill/decode with shared model weights), this would race. Worth a comment noting the single-threaded assumption.
2. Synchronous H2D copies in prepare() are a latency bottleneck
Each cache miss does a synchronous copy_() from CPU pinned memory to GPU. For large expert weights (e.g., DeepSeek-V2's 160 experts with ~7M params each), a miss could take 1-2ms per expert. If multiple misses occur in one forward pass (common with top-k=6 routing), this serialized copy could add 5-10ms per layer.
Consider using torch.cuda.Stream for async H2D copies with an event-based sync, or batching all misses into a single torch.cat + copy. The current approach is correct but may significantly impact throughput in practice.
3. The mapping tensor in prepare() is recreated every call
mapping = torch.zeros(self._num_experts, dtype=torch.int64)
for expert_id, slot in self._expert_to_slot.items():
mapping[expert_id] = slot
mapping = mapping.to(device=topk_ids.device)This allocates a new CPU tensor, fills it with a Python loop, and transfers it to GPU on every forward pass. For a model with 160 experts and 60+ layers, this adds up. Consider keeping a persistent _mapping tensor on GPU and only updating the changed entries in-place.
4. _forward_with_expert_cache bypasses several runner features
The cache forward path calls fused_experts() directly, bypassing the normal runner's handling of:
w13_bias/w2_bias(MoE layers with bias)- Expert-parallel scatter/gather
- Scale tensors for quantized weights (
w13_weight_scale,w2_weight_scale) - Custom activation functions beyond
self.activation
The EP and quantization incompatibilities are documented, but the bias case isn't mentioned. If any MoE model uses bias terms, this path would silently produce wrong results.
5. Missing enforce_eager validation
The docstring says --enforce-eager is required, but I don't see validation that rejects moe_expert_cache_size > 0 when enforce_eager=False. The @torch.compiler.disable decorator on _forward_with_expert_cache helps, but if CUDA graphs are used at a higher level, the dynamically changing buffer contents would cause correctness issues. Consider adding a config validator that errors out if moe_expert_cache_size > 0 and not enforce_eager.
6. Memory accounting
When expert weights are allocated on CPU pinned memory, vLLM's GPU memory profiler won't account for them. This means gpu_memory_utilization calculations will over-estimate available KV cache memory by the amount of expert weight memory that was moved to CPU. The profiler may need to be made aware of the CPU pinned allocation to avoid OOM during KV cache allocation.
7. Tests are good but limited
The correctness test (compare_two_settings) verifies output token matching, which is the most important thing. Consider also testing:
- Cache hit/miss counters (to verify the LRU logic is working)
- Edge case:
cache_size >= num_experts(all experts fit, no eviction) - Edge case:
cache_size = 1(maximum eviction pressure)
Overall this is a solid first implementation of MoE expert offloading. The main production concerns are the synchronous H2D copy latency and the missing enforce_eager validation.
|
Thanks for the thorough review @alvinttang! Addressing each point: 1. Thread safety — Added a comment in 2. Synchronous H2D copies — Agreed, this is the main latency bottleneck. Async H2D with double-buffered CUDA streams (the "DBO scheduling" from RFC #33869) is the top item in the planned PR 2. Mentioning it here so it's on record. 3. Persistent mapping tensor — Implemented in 68c81df. 4. Bias bypass — Guard added in 68c81df: 5. enforce_eager guard — In the code since 68c81df. From if self._moe_expert_cache_size > 0 and (
not vllm_config.model_config.enforce_eager
):
logger.warning(
"moe_expert_cache_size requires --enforce-eager; ..."
)
self._moe_expert_cache_size = 0The cache is silently disabled (not just warned) when 6. Memory accounting — Valid concern. The GPU profiler won't see CPU-pinned allocations, so it will over-allocate KV cache against memory that expert weights no longer occupy. This is actually a benefit (more KV cache headroom), not a hazard — the expert weights are no longer on GPU. But you're right that if someone relies on 7. Tests — 16 unit tests in |
4db08e9 to
618392a
Compare
|
Hi @e1n00r, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
29afd27 to
6af6bba
Compare
|
Hi @e1n00r, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Also check this paper: https://arxiv.org/html/2410.17954v1 Instead of LRU, they load with a predictor: "ExpertFlow consists of three key components: the Routing Path Predictor, the Expert Cache Engine, and the Token Scheduler. Leveraging the three synergistic components of our system, ExpertFlow achieves an average GPU memory savings of 75.4%, with peak savings reaching up to 93.72%, compared to GPU-only solutions. Furthermore, ExpertFlow attains an expert cache hit ratio of up to 91.96%, improving the hit ratio by an average of 27.65% over the LRU caching strategy. Additionally, ExpertFlow delivers a 2 to 10 times increase in inference speed." |
If I do that we just made powerinfer again, which is a well established solution in its own right. |
|
Well, apparently there's quite a few options here:
But these are not better than Predictor-based systems (e.g., ProMoE, ExpertFlow) and Learned replacement (e.g., FlashMoE ML policy). Strong non-ML alternatives: Of course, that's all for another PR, it's important to at least get this caching strategy ball rolling - the possible speedups seem to be massive. Maybe it would be nice to make the strategy pluggable? |
Standard LRU lets early layers monopolize the cache because they execute first every forward pass. LFRU tracks per-expert access frequency (decayed) and evicts the expert with lowest score = freq / (1 + recency). On GPT-OSS-20B: deep-layer hit rate improved from 0-8% to 52-94%. Critical for models with 128 experts/layer (Gemma 4, Nemotron). LFRUCachedWeightProvider is a drop-in replacement for CachedWeightProvider. Ref: vllm-project#37190 (e1n00r LFRU findings) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gemma 4 26B-A4B-it validation (128 experts × top-8, 30 layers)Tested the rebased fork on google/gemma-4-26B-A4B-it — a very different MoE architecture from Nemotron:
Results (cache_size=8, RTX PRO 6000 Blackwell)
Required fixes for Colab/JupyterHad to add a try/except in Also quantized with PolarQuant Q5Quantized all 7,680 MoE expert weights (3D Note on LFRULooking forward to testing the LFRU policy on this model. With 128 experts × 30 layers and top-8 routing, the deep-layer cache starvation you described for GPT-OSS-20B should be even more pronounced here. I've implemented an initial LFRU version in the fork (commit 7a19e4d) — happy to coordinate on this. |
|
@caiovicentino — the LFRU results confirm what we measured independently. We have LFRU and 6 other policies implemented and benchmarked in tinyserve. Based on your validation (LFRU cache=8 exceeds LRU cache=16 in hit rate) and our own benchmarks (deep-layer starvation eliminated, +5-50% throughput), we've updated the PR to ship LFRU as the built-in eviction policy instead of LRU. No config field, no pluggable framework — just the better algorithm baked in. Keeps the PR focused. The Gemma 4 validation (14.8 tok/s, 8.62 GB) is strong — two models on two architectures with correct output. For the rebase: we've rebased onto current main using your conflict resolutions (co-authored). One clean commit, 14 files, no workflow changes. For experimentation: if you want to try ideas freely — buddy substitution, CPU-on-miss, dynamic VRAM rebalancing, imatrix cache seeding, per-layer cache budgets — tinyserve is the playground. It's ~7K LOC Python with 340 tests, all of these features implemented and benchmarked. Alternatively, we could set up a shared experimental branch on the vLLM fork for testing concepts before they go into a PR. Either way works — happy to coordinate. |
73a7584 to
f3e6781
Compare
e07f615 to
71ed1fc
Compare
71ed1fc to
73f9f89
Compare
|
@mgoin — could you add the |
|
This pull request has merge conflicts that must be resolved before it can be |
… Nemotron fixes Cherry-picked and resolved conflicts for all 7 commits from e1n00r/vllm@feature/moe-expert-lru-cache onto vllm-project/vllm main. Resolved conflicts in: - layer.py: merged expert cache init with current __init__ structure - unquantized_fused_moe_method.py: merged provider check with current API - fp8.py: added cache init call - offload.py: added moe_expert_cache_size field Tested on Nemotron-Cascade-2-30B-A3B (RTX PRO 6000 Blackwell): - cache=8: 15.6 tok/s, correct output - cache=16: 19.6 tok/s - cache=32: 24.4 tok/s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standard LRU lets early layers monopolize the cache because they execute first every forward pass. LFRU tracks per-expert access frequency (decayed) and evicts the expert with lowest score = freq / (1 + recency). On GPT-OSS-20B: deep-layer hit rate improved from 0-8% to 52-94%. Critical for models with 128 experts/layer (Gemma 4, Nemotron). LFRUCachedWeightProvider is a drop-in replacement for CachedWeightProvider. Ref: vllm-project#37190 (e1n00r LFRU findings) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@e1n00r — We rebased the expert offloading code on top of the latest upstream/main (70406eb, April 7th) and confirmed it works end-to-end with a CompressedTensors INT4 MoE model. Test results (Qwopus-MoE-35B-A3B INT4 CT, RTX PRO 6000 Blackwell 102 GB):
The LFRU cache is 1.58x faster than all-in-GPU — better memory locality with 8 hot experts vs 256 scattered. Coherent output confirmed (not garbage). Our rebased fork: https://github.com/caiovicentino/vllm-expert-offload (branch The rebase resolved one conflict in Co-authored-by: Caio Vicentino caiovicentino@users.noreply.github.com |
|
Benchmark charts for the INT4 MoE test above: Model card with full details: https://huggingface.co/caiovicentino1/Qwopus-MoE-35B-A3B-PolarQuant-Q5 PPL 6.56 on full WikiText-2 (295K tokens) — virtually identical to BF16 baseline. The LFRU expert cache with only 8 hot experts is faster than loading all 256 into GPU. |
|
Hi @e1n00r @caiovicentino ! Thank you for your excellent work and test results. I have a small problem though: when the cache is overflowed during prefill, the current behaviour is to keep the suffix of the required expert list, which consists of ones with the greatest ids since the list sorted by |
…ia expert CPU offloading Expert weights live in CPU pinned memory; a GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. On cache hit: zero-copy GPU forward from fixed-address buffer. On miss: synchronous H2D copy, expert remapped to cache slot. CLI: --moe-expert-cache-size N --enforce-eager Config: OffloadConfig.moe_expert_cache_size Tested on OLMoE-1B-7B (8 GB GPU), Nemotron-Cascade-2-30B-A3B (7.6 GB VRAM, 15+ tok/s), and Gemma-4-26B-A4B-it (8.6 GB VRAM, 14.8 tok/s). LFRU validated independently on Nemotron: cache=8 LFRU exceeds cache=16 LRU in hit rate. RFC: vllm-project#38256 Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so> Co-authored-by: Caio Vicentino <caiovicentino@Mac.lan> Co-authored-by: Claude <noreply@anthropic.com>
- expert_weight_provider.py: assert best_key is not None before dict.pop() (best_key is set by the loop which runs when _lru is non-empty) - unquantized_fused_moe_method.py: assert experts_cls is not None before make_unquantized_moe_kernel() in cache-active path (mirrors line 193) - ruff format: layer.py, unquantized_fused_moe_method.py, test_expert_lru_cache.py Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so>
30bdcda to
1844fbb
Compare
…ent truncation Prefill batches can activate more unique experts than gpu_capacity. The previous code kept the highest-ID experts (by unique() sort order), which is arbitrary and produces silently incorrect outputs: tokens routed to dropped experts compute with stale slot weights from a different expert. Replace with a hard RuntimeError pointing users to --moe-expert-cache-size. Chunked-prefill-aware loading (sub-batch within capacity) is the correct long-term fix and will come in a follow-up PR. Fixes concern raised by @tkj666 in PR review. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Elnur Abdullaev <elnur.abdullaev@sonia.so>
1844fbb to
1a8df90
Compare
|
@tkj666 great catch — this is a real limitation and worth being explicit about. You're correct: Why it ended up this wayDuring prefill there's no LRU history yet — no prior tokens have touched the experts, so we have zero signal about which to keep. We defaulted to "keep the suffix of The real issueFor short prefills (
Better policy: prefill-access-order LRUThe right fix is tracking expert access order during prefill in a small ring buffer and using that as the initial LRU state when decode begins. Turns "arbitrary prefill state" into "actual recency from prefill traffic" at near-zero cost. Benchmark planWe want to quantify the quality impact before picking a policy. We're planning to compare on a long-prompt workload with
Thanks for flagging this — it matters for long-context / small-cache workloads exactly like you describe. Will follow up on this thread when we have numbers. |
|
@tkj666 @caiovicentino — thanks both. The fix is now in 1a8df90. Digging into it we found the suffix-of-
Given that, any truncation policy (suffix / prefix / random / prefill-access-order) has the same defect unless we also fence off the dropped experts in if len(unique_ids) > self.capacity:
raise RuntimeError(
f"CachedWeightProvider: {len(unique_ids)} unique experts requested "
f"but --moe-expert-cache-size={self.capacity}. "
f"Set --moe-expert-cache-size >= {len(unique_ids)}."
)The prefill-LRU seeding @caiovicentino sketched is the right direction for a follow-up PR alongside chunked-prefill-aware loading — once the kernel has a way to skip tokens whose expert is not resident, the policy question becomes meaningful. Until then, "fit all unique experts in capacity, or error" is the only correct option. Practical guidance for hitting this: either set |
|
@e1n00r — I think that we can make |
Empirical validationTested "partition experts into N groups, call The constraint: use
|
| Configuration | Per layer | Bottleneck |
|---|---|---|
| Persistent full-cache (capacity=32) | 6.5 ms | gemm compute |
| Mini-EP 2/4/8 groups, cold full swap (cap=16/8/4) | ~100 ms | PCIe transfer, ~7 GB/s observed, ~3.5 ms/expert |
| Mini-EP 2 groups, warm cache (cap=28), 2 new experts | ~17 ms | PCIe transfer (partial) + extra kernel call |
| Mini-EP 2 groups, warm cache, 0 new experts | ~12 ms | extra kernel call overhead |
Kernel launch overhead is secondary: +6% for 2 groups, +15% for 4, +35% for 8 at M=1024 prefill (5-6 Triton launches per call × ~100 μs each). The 15x slowdown in the cold-swap case is entirely PCIe bandwidth — total bytes transferred are identical regardless of how you partition the groups.
Practical consequence for the design: mini-EP is the correct fallback when VRAM cannot fit all experts, and --moe-expert-cache-size >= num_experts remains the recommended path whenever it fits. The overhead grows linearly with the number of new experts per layer, so the design should prefer a high cache hit rate over a clever grouping strategy.
Scope: PR2 alongside async H2D
Holding this out of PR1 for three reasons:
-
PR2 already restructures
prepare()and the forward loop for the async copy stream. Mini-EP and async prefetch share the same API surface (prepare_chunked()with optional pre-fetched slots), so designing them together avoids an architectural carve-out that a PR1 bolt-on would require. -
Routing mini-EP through
fused_experts_implmeans the provider path diverges from the modular kernel infrastructure for shared_experts / quant_config / prepare_finalize dispatch. That's a deliberate architectural choice deserving its own design discussion with @mgoin, rather than a late addition to a month-old PR. -
Two open questions require real-model runs:
- Does
try_get_optimal_moe_configcache-hit across groups when effectivenum_valid_tokenschanges per group? A miss means autotune re-runs every group, destroying first-forward latency. - Does bf16 accumulation drift over 30-60 layers × hundreds of forward passes stay within
atol=2e-2? Microbenchmark drift is at the mantissa floor; real-model drift under repeated reductions is untested.
- Does
Happy to prototype this on a follow-up branch after PR1 merges — your framing ("mini-EP with each replica executing one by one on the same GPU") is the right mental model, and the two non-obvious implementation points are (a) the non-modular routing required to avoid the workspace aliasing, and (b) hoisting unique() out of the per-group loop to pay the D2H sync once per forward rather than N times.
|
Thanks for the rigor on this — the workspace aliasing finding alone is the kind of latent bug that would survive unit tests and only surface in real-model accumulation. Calling it a correctness defect rather than a "be careful" footnote is the right framing. A few notes on each section: Empirical validation. The bitwise-equal-at-small / mantissa-floor-at-large pattern is exactly the equivalence story you can defend in a PR description without hedging.
PCIe dominance. Matches what we saw on Nemotron-Cascade-2-30B-A3B and Gemma-4-26B-A4B. The LFRU validation reproduced the same early-layer-hot / deep-layer-starved pattern you described on GPT-OSS-20B, and the hit-rate gains were the dominant lever — bigger than any grouping cleverness we tried. Your 15x cold-swap measurement reinforces that "fit, or fail fast" is the only honest PR1 contract. PR2 scope. Agree on holding it out. On the two open questions:
Happy to do both on whatever branch you set up after PR1 merges. Ping me when there's a target commit. |

Purpose
CachedWeightProvider— MoE expert CPU offloading with GPU LFRU cache, addressing RFC #38256.Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. Models that exceed GPU VRAM can now run on smaller hardware.
No runner bypass — all paths go through
quant_method.apply(). EP dispatch, DP chunking, and shared expert overlap work unchanged.References: RFC #38256 | tinyserve (production validation, 481 tests)
Test results
Community validation (independent):
LFRU vs LRU (Nemotron, cache=8): LFRU cache=8 exceeds LRU cache=16 in hit rate. +5.2% speed improvement.
Unit tests: 26 test cases (parametrized across dtypes, capacities, num_experts). Tests LFRU-specific eviction behavior (frequency-weighted, not just recency).
Changes
15 files, ~810 additions
expert_weight_provider.py(new)CachedWeightProviderwith LFRU eviction,ExpertWeightResultdataclassfused_moe_method_base.pysupports_expert_lru_cacheproperty (default False)fused_moe_modular_method.pyapply()layer.py_maybe_init_expert_lru_cache(),expert_weight_providerattributeunquantized_fused_moe_method.pyquantization/fp8.pysupports_expert_lru_cache, provider check, cache initoffload.pymoe_expert_cache_sizeconfig fieldvllm.pyarg_utils.py--moe-expert-cache-sizellm.pymoe_expert_cache_sizeparameter inLLM.__init__basic_correctness.yamldocs/features/moe_cache_policies.md(new)test_expert_lru_cache.py(new)test_moe_expert_cache.py(new)compare_two_settingsbenchmarks/qwen_122b_test_20260331.txt(new)How it works
Limitations
--enforce-eagerrequired (CUDA graph compat deferred to PR 2)Test plan
AI-assisted development (Claude Code). Architecture validated in tinyserve.